How do the clothing companies decide the size of their T-shirt?
What size should a t-shirt be?
Everyone’s real t-shirt size is different, but how can they figure out the XS, S, M, L and XL labels?
At the first, they don’t have these lables, they only have some information about customers. Let’s think about how tailer make your customer-tailored T-shirt. They may measure your neck width(collar), arm length, chest width, waistline and so on. But they don’t have any lable to make as fewer as possible sizes so that they can save cost. Let’s say they only want to have five moduel. So the problem is how to find these five sizes so that most of the customers can buy a comfortable one, and meanwhile, when they have the right size, the T-shirt is not too large or to small. In statistics, this problem is equivalent to find five clusters based on provided information so that the variation within clusters is small, and between clusters variation is large.
Add more visual graphs to illustrate the clustering.
This tutorial illustrates the implentations of clustering and association rule mining in (R).
We use the seeds data set to demonstrate cluster analysis in R. The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each. A description of the dataset can be viewed at (https://archive.ics.uci.edu/ml/datasets/seeds)
seed = read.table('http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt', header=F)
seed = seed[,1:7]
colnames(seed) = c("area", "perimeter","campactness", "length", "width", "asymmetry", "groovelength")
Scale data to have zero mean and unit variance for each column
seed <- scale(seed)
Use k-means method for clustering and plot results.
# K-Means Cluster Analysis
fit <- kmeans(seed, 5) #5 cluster solution
#Display number of clusters in each cluster
table(fit$cluster)
##
## 1 2 3 4 5
## 39 56 31 39 45
#Plot cluster in kmeans
install.packages("fpc")
library(fpc)
plotcluster(seed, fit$cluster)
#See exactly which item are in 1st group
seed[fit$cluster==1,]
#get cluster means for scaled data
aggregate(seed,by=list(fit$cluster),FUN=mean)
## Group.1 area perimeter campactness length width
## 1 1 -1.1196277 -1.0357288 -1.5439732 -0.8332892 -1.3254203
## 2 2 1.3885128 1.3867754 0.5989659 1.3721886 1.2746370
## 3 3 -0.8435994 -0.8992987 -0.1233505 -0.9202448 -0.7034455
## 4 4 -0.4882294 -0.5527356 0.3850651 -0.6738515 -0.3016514
## 5 5 0.2466954 0.2704211 0.3439821 0.2325225 0.3085097
## asymmetry groovelength
## 1 0.5793810 -0.49143957
## 2 -0.0910015 1.39643675
## 3 1.1433053 -0.71050726
## 4 -0.8964679 -0.99728952
## 5 -0.3995553 0.04190449
#or alternatively, use the output of kmeans
fit$centers
## area perimeter campactness length width asymmetry
## 1 -1.1196277 -1.0357288 -1.5439732 -0.8332892 -1.3254203 0.5793810
## 2 1.3885128 1.3867754 0.5989659 1.3721886 1.2746370 -0.0910015
## 3 -0.8435994 -0.8992987 -0.1233505 -0.9202448 -0.7034455 1.1433053
## 4 -0.4882294 -0.5527356 0.3850651 -0.6738515 -0.3016514 -0.8964679
## 5 0.2466954 0.2704211 0.3439821 0.2325225 0.3085097 -0.3995553
## groovelength
## 1 -0.49143957
## 2 1.39643675
## 3 -0.71050726
## 4 -0.99728952
## 5 0.04190449
Here a simple within group sum of squares method is used.
# Determine number of clusters
wss <- (nrow(seed)-1)*sum(apply(seed,2,var))
for (i in 2:12) wss[i] <- sum(kmeans(seed,
centers=i)$withinss)
plot(1:12, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
Prediction strength for estimating number of clusters. The prediction strength is defined according to Tibshirani and Walther (2005), who recommend to choose as optimal number of cluster the largest number of clusters that leads to a prediction strength above 0.8 or 0.9.
prediction.strength(seed, Gmin=2, Gmax=15, M=10,cutoff=0.8)
## Prediction strength
## Clustering method: kmeans
## Maximum number of clusters: 15
## Resampled data sets: 10
## Mean pred.str. for numbers of clusters: 1 0.8470612 0.7778921 0.4833412 0.4015382 0.3623321 0.3461767 0.2895573 0.2337563 0.257171 0.1936788 0.1565038 0.17217 0.1658633 0.1208181
## Cutoff value: 0.8
## Largest number of clusters better than cutoff: 2
fpc package has cluster.stat() function that can calcuate other cluster validity measures such as Average Silhouette Coefficient (between -1 and 1, the higher the better), or Dunn index (betwen 0 and infinity, the higher the better):
d = dist(seed, method = "euclidean")
result = matrix(nrow = 14, ncol = 3)
for (i in 2:15){
cluster_result = kmeans(seed, i)
clusterstat=cluster.stats(d, cluster_result$cluster)
result[i-1,1]=i
result[i-1,2]=clusterstat$avg.silwidth
result[i-1,3]=clusterstat$dunn
}
plot(result[,c(1,2)], type="l", ylab = 'silhouette width', xlab = 'number of clusters')
plot(result[,c(1,3)], type="l", ylab = 'dunn index', xlab = 'number of clusters')
The package NbClust provides 30 indexes for determining the optimal number of clusters in a data set.
For more sophiscated methods, see for example blog, or course notes.
This article on Cross Validated provides a great illustration of the situations when k-means would fail.
#Wards Method or Hierarchical clustering
#Calculate the distance matrix
seed.dist=dist(seed)
#Obtain clusters using the Wards method
seed.hclust=hclust(seed.dist, method="ward")
plot(seed.hclust)
#Cut dendrogram at the 3 clusters level and obtain cluster membership
seed.3clust = cutree(seed.hclust,k=3)
#See exactly which item are in third group
seed[seed.3clust==3,]
#get cluster means for raw data
#Centroid Plot against 1st 2 discriminant functions
#Load the fpc library needed for plotcluster function
library(fpc)
#plotcluster(ZooFood, fit$cluster)
plotcluster(seed, seed.3clust)
A newer clustering appraoch, model-based cluster, treats the clustering problem as maximizing a Normal mixture model. Generating an observation in this model consists of first picking a centroid (mean of a multivariate normal distribution) at random and then adding some noise (variances). If the noise is normally distributed, this procedure will result in clusters of spherical shape. Model-based clustering assumes that the data were generated by a model and tries to recover the original model from the data. The model that we recover from the data then defines clusters and an assignment of documents to clusters. It can be thought as a generalization of \(K\)-means.
The model “recovering” process is done via Expectation-Maximization(EM) algorithm. It is an iterative approach to maximize the likelihood of a statistical model when the model contains unobserved variables.
One obvious advantage of the approach is that we can treat the question “How Many Clusters?” as a model selection problem.
For detailed description of the method and the package, see 1 and 2
install.packages('mclust')
library(mclust)
mclust_result = Mclust(seed)
summary(mclust_result)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EEV (ellipsoidal, equal volume and shape) model with 4 components:
##
## log.likelihood n df BIC ICL
## 416.1639 210 122 179.9808 172.0903
##
## Clustering table:
## 1 2 3 4
## 44 50 67 49
The BIC used in the package is the negative of the ‘usual’ BIC when we discussed regression models. Therefore we are trying to maximize the BIC here.
plot(mclust_result)
Association Rules is a popular and well researched method for discovering interesting relations between variables in large databases. arules package in R provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules.
For an introduction to arules and additional case studies see (http://cran.r-project.org/web/packages/arules/vignettes/arules.pdf)
For the reference manual for the package see (http://cran.r-project.org/web/packages/arules/arules.pdf))
The Groceries data set contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9835 transactions and the items are aggregated to 169 categories.
library(arules)
data("Groceries")
#run summary report
summary(Groceries)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55
## 16 17 18 19 20 21 22 23 24 26 27 28 29 32
## 46 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels level2 level1
## 1 frankfurter sausage meat and sausage
## 2 sausage sausage meat and sausage
## 3 liver loaf sausage meat and sausage
summary() displays the most frequent items in the data set, information about the transaction length distribution and that the data set contains some extended transaction information. We see that the data set contains transaction IDs. This additional information can be used for analyzing the data set.
To find the very long transactions we can use the size() and select very long transactions (containing more than 25 items).
#
x = Groceries[size(Groceries) > 25]
inspect(x)
## items
## [1] {sausage,
## finished products,
## tropical fruit,
## pip fruit,
## other vegetables,
## butter,
## curd,
## dessert,
## butter milk,
## yogurt,
## whipped/sour cream,
## soft cheese,
## frozen vegetables,
## rolls/buns,
## white bread,
## brown bread,
## pastry,
## pasta,
## soups,
## baby food,
## fruit/vegetable juice,
## salty snack,
## waffles,
## cake bar,
## chocolate,
## shopping bags}
## [2] {frankfurter,
## sausage,
## liver loaf,
## ham,
## chicken,
## beef,
## citrus fruit,
## tropical fruit,
## root vegetables,
## other vegetables,
## whole milk,
## butter,
## curd,
## yogurt,
## whipped/sour cream,
## beverages,
## soft cheese,
## hard cheese,
## cream cheese ,
## mayonnaise,
## domestic eggs,
## rolls/buns,
## roll products ,
## flour,
## pasta,
## margarine,
## specialty fat,
## sugar,
## soups,
## skin care,
## hygiene articles,
## candles}
## [3] {meat,
## turkey,
## tropical fruit,
## root vegetables,
## onions,
## other vegetables,
## whole milk,
## butter,
## yogurt,
## whipped/sour cream,
## sliced cheese,
## hard cheese,
## frozen meals,
## frozen dessert,
## domestic eggs,
## white bread,
## salt,
## rice,
## oil,
## margarine,
## mustard,
## baking powder,
## dog food,
## bottled water,
## fruit/vegetable juice,
## rum,
## long life bakery product,
## chocolate,
## cooking chocolate}
## [4] {beef,
## hamburger meat,
## tropical fruit,
## root vegetables,
## onions,
## other vegetables,
## whole milk,
## butter,
## yogurt,
## whipped/sour cream,
## sliced cheese,
## mayonnaise,
## frozen vegetables,
## frozen potato products,
## rolls/buns,
## white bread,
## rice,
## oil,
## specialty fat,
## specialty vegetables,
## canned fish,
## coffee,
## instant coffee,
## bottled beer,
## liquor (appetizer),
## long life bakery product,
## chocolate,
## napkins,
## house keeping products}
## [5] {frankfurter,
## sausage,
## liver loaf,
## beef,
## tropical fruit,
## pip fruit,
## root vegetables,
## onions,
## whole milk,
## butter,
## yogurt,
## whipped/sour cream,
## sliced cheese,
## cream cheese ,
## frozen meals,
## frozen potato products,
## domestic eggs,
## rolls/buns,
## roll products ,
## salt,
## rice,
## margarine,
## jam,
## coffee,
## long life bakery product,
## waffles,
## hygiene articles}
## [6] {sausage,
## pip fruit,
## root vegetables,
## onions,
## whole milk,
## butter,
## dessert,
## soft cheese,
## cream cheese ,
## specialty cheese,
## domestic eggs,
## salt,
## pasta,
## vinegar,
## oil,
## mustard,
## ketchup,
## canned vegetables,
## pickled vegetables,
## canned fish,
## instant coffee,
## tea,
## misc. beverages,
## bottled beer,
## white wine,
## chocolate,
## abrasive cleaner,
## hygiene articles}
## [7] {sausage,
## pork,
## citrus fruit,
## tropical fruit,
## pip fruit,
## root vegetables,
## whole milk,
## butter,
## curd,
## yogurt,
## cream,
## sliced cheese,
## hard cheese,
## frozen vegetables,
## frozen fish,
## frozen potato products,
## brown bread,
## margarine,
## baking powder,
## cat food,
## pet care,
## soda,
## fruit/vegetable juice,
## nut snack,
## chocolate,
## female sanitary products,
## hygiene articles,
## napkins,
## house keeping products}
To see which items are important in the data set we can use the itemFrequencyPlot(). To reduce the number of items, we only plot the item frequency for items with a support greater than 10%. The label size is reduced with the parameter cex.names.
#
itemFrequencyPlot(Groceries, support = 0.1, cex.names=0.8)
Use apriori() algorithm to find all rules (the default association type for apriori()) with a minimum support of 0.3% and a confidence of 0.5.
# Run the apriori algorithm
basket_rules <- apriori(Groceries,parameter = list(sup = 0.003, conf = 0.5,target="rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.003 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 29
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [136 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [421 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(basket_rules)
## set of 421 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 5 281 128 7
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 3.325 4.000 5.000
##
## summary of quality measures:
## support confidence lift
## Min. :0.003050 Min. :0.5000 Min. :1.957
## 1st Qu.:0.003355 1st Qu.:0.5238 1st Qu.:2.135
## Median :0.003965 Median :0.5556 Median :2.426
## Mean :0.004754 Mean :0.5715 Mean :2.522
## 3rd Qu.:0.005186 3rd Qu.:0.6094 3rd Qu.:2.766
## Max. :0.022267 Max. :0.8857 Max. :5.804
##
## mining info:
## data ntransactions support confidence
## Groceries 9835 0.003 0.5
Recall from class:
Support: The support of an itemset X or \(latex supp(X)\) is defined as the proportion of transactions in the data set which contain the itemset. In the zoo data, the support for the rules is relatively low, with a maximum support of no more than 3%.
Confidence: The confidence of a rule is defined as \(conf( X\Rightarrow Y) = supp( X \cup Y )/supp(X)\). For example, the rule {milk, bread} \(\Rightarrow\) {butter} has a confidence of 0.5, which means that for 50% of the transactions containing milk and bread the rule is correct. Confidence can be interpreted as an estimate of the conditional probability P(Y |X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS. Association rules are required to satisfy both a minimum support and a minimum confidence constraint at the same time.
Lift: Lift is a popular measure of to filter or rank found rules. The lift of a rule is defined as \(lift(X \Rightarrow Y ) = supp(X \cup Y )/(supp(X)supp(Y ))\). Lift can be interpreted as the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS. Greater lift values indicate stronger associations.
# Check the generated rules using inspect
inspect(head(basket_rules))
## lhs rhs support confidence
## [1] {cereals} => {whole milk} 0.003660397 0.6428571
## [2] {specialty cheese} => {other vegetables} 0.004270463 0.5000000
## [3] {rice} => {other vegetables} 0.003965430 0.5200000
## [4] {rice} => {whole milk} 0.004677173 0.6133333
## [5] {baking powder} => {whole milk} 0.009252669 0.5229885
## [6] {root vegetables,herbs} => {other vegetables} 0.003863752 0.5507246
## lift
## [1] 2.515917
## [2] 2.584078
## [3] 2.687441
## [4] 2.400371
## [5] 2.046793
## [6] 2.846231
As typical for association rule mining, the number of rules found is huge. To analyze these rules, for example, subset() can be used to produce separate subsets of rules. Now find the subset of rules that has 4 or more length (lhs+rhs).
#Basket rules of size greater than 4
inspect(subset(basket_rules, size(basket_rules)>4))
## lhs rhs support confidence lift
## [1] {citrus fruit,
## tropical fruit,
## root vegetables,
## other vegetables} => {whole milk} 0.003152008 0.7045455 2.757344
## [2] {citrus fruit,
## tropical fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.003152008 0.8857143 4.577509
## [3] {citrus fruit,
## tropical fruit,
## other vegetables,
## whole milk} => {root vegetables} 0.003152008 0.6326531 5.804238
## [4] {citrus fruit,
## root vegetables,
## other vegetables,
## whole milk} => {tropical fruit} 0.003152008 0.5438596 5.183004
## [5] {tropical fruit,
## root vegetables,
## other vegetables,
## yogurt} => {whole milk} 0.003558719 0.7142857 2.795464
## [6] {tropical fruit,
## root vegetables,
## whole milk,
## yogurt} => {other vegetables} 0.003558719 0.6250000 3.230097
## [7] {tropical fruit,
## root vegetables,
## other vegetables,
## whole milk} => {yogurt} 0.003558719 0.5072464 3.636128
Find the subset of rules with lift greater than 5:
inspect(subset(basket_rules, lift>5))
## lhs rhs support confidence lift
## [1] {citrus fruit,
## tropical fruit,
## other vegetables,
## whole milk} => {root vegetables} 0.003152008 0.6326531 5.804238
## [2] {citrus fruit,
## root vegetables,
## other vegetables,
## whole milk} => {tropical fruit} 0.003152008 0.5438596 5.183004
Now find the subset rules that has Yogurt in the right hand side. Here we require lift measure exceeds 1.2.
yogurt.rhs <- subset(basket_rules, subset = rhs %in% "yogurt" & lift>3.5)
Now inspect the subset rules
inspect(yogurt.rhs)
## lhs rhs support confidence lift
## [1] {whipped/sour cream,
## cream cheese } => {yogurt} 0.003355363 0.5238095 3.754859
## [2] {root vegetables,
## cream cheese } => {yogurt} 0.003762074 0.5000000 3.584184
## [3] {tropical fruit,
## curd} => {yogurt} 0.005287239 0.5148515 3.690645
## [4] {other vegetables,
## whole milk,
## cream cheese } => {yogurt} 0.003457041 0.5151515 3.692795
## [5] {tropical fruit,
## whole milk,
## curd} => {yogurt} 0.003965430 0.6093750 4.368224
## [6] {tropical fruit,
## other vegetables,
## butter} => {yogurt} 0.003050330 0.5555556 3.982426
## [7] {tropical fruit,
## whole milk,
## butter} => {yogurt} 0.003355363 0.5409836 3.877969
## [8] {tropical fruit,
## whole milk,
## whipped/sour cream} => {yogurt} 0.004372140 0.5512821 3.951792
## [9] {tropical fruit,
## root vegetables,
## other vegetables,
## whole milk} => {yogurt} 0.003558719 0.5072464 3.636128
Now find the subset rules that has Meat in the left hand side.
meat.lhs <- subset(basket_rules, subset = lhs %in% "meat" & lift>1.5)
Now inspect the subset rules
inspect(meat.lhs)
## lhs rhs support confidence lift
## [1] {meat,root vegetables} => {whole milk} 0.003152008 0.62 2.426462
We can use the arulesViz package to visualize the rules, for a more complete introduction see (http://cran.r-project.org/web/packages/arulesViz/vignettes/arulesViz.pdf).
install.packages('arulesViz')
library('arulesViz')
plot(basket_rules)
The plot function has an interactive mode for you to inspect individual rules:
plot(basket_rules, interactive=TRUE)
Graph-based visualization can be used for very small sets of rules. The vertices are represented by items for the 10 rules with highest lift:
plot(head(sort(basket_rules, by="lift"), 10), method = "graph")
The package comes with an approach to cluster association rules and itemsets:
plot(basket_rules, method="grouped")
For problem 1 Iris data, simply use the Iris dataset in R. When doing cluster analysis for Iris you’ll want to ignore the Species variable.
data(iris)
For problem 2 Cincinnati Zoo data, use the following code to load the transaction data for association rules mining. as() function coerce the dataset into transaction data type for association rules mining.
TransFood <- read.csv('https://xiaoruizhu.github.io/Data-Mining-R/data/food_4_association.csv')
TransFood <- TransFood[, -1]
TransFood <- as(as.matrix(TransFood), "transactions")
Load the data for clustering:
Food_by_month <- read.csv('https://xiaoruizhu.github.io/Data-Mining-R/data/qry_Food_by_Month.csv')